This exercise relied on the twitter API, which is no longer available. However a new version of the academic API appears to have recently been made available again. Unsure how this will develop. We will use twitter data collected in 2020 for this exercise.

Introduction

In this tutorial, you will learn how to:

Setup

The hands-on exercise for this week uses dictionary-based methods for filtering and scoring words. Dictionary-based methods use pre-generated lexicons, which are no more than list of words with associated scores or variables measuring the valence of a particular word. In this sense, the exercise is not unlike our analysis of Edinburgh Book Festival event descriptions. Here, we were filtering descriptions based on the presence or absence of a word related to women or gender. We can understand this approach as a particularly simple type of “dictionary-based” method. Here, our “dictionary” or “lexicon” contained just a few words related to gender.

Load data and packages

Before proceeding, we’ll load the remaining packages we will need for this tutorial.

library(kableExtra)
library(tidyverse) # loads dplyr, ggplot2, and others
library(readr) # more informative and easy way to import data
library(stringr) # to handle text elements
library(tidytext) # includes set of functions useful for manipulating text
library(quanteda) # includes functions to implement Lexicoder
library(textdata)
library(academictwitteR) # for fetching Twitter data

First off: always check that you have the right working directory

getwd()
## [1] "/Users/marionlieutaud/Dropbox (Personal)/My Mac (MacBook-Pro-3.local)/Documents/GitHub/CTA-ED-exercise2"

In this exercise we’ll be using another new dataset. The data were collected from the Twitter accounts of the top eight newspapers in the UK by circulation. You can see the names of the newspapers in the code below:

# This is a code chunk to show the code that collected the data using the twitter API, back in 2020. 
# You don't need to run this, and this chunk of code will be ignored when you knit to html, thanks to the 'eval=FALSE' command in the chunk option.

newspapers = c("TheSun", "DailyMailUK", "MetroUK", "DailyMirror", 
               "EveningStandard", "thetimes", "Telegraph", "guardian")

tweets <-
  get_all_tweets(
    users = newspapers,
    start_tweets = "2020-01-01T00:00:00Z",
    end_tweets = "2020-05-01T00:00:00Z",
    data_path = "data/sentanalysis/",
    n = Inf,
  )

tweets <- 
  bind_tweets(data_path = "data/sentanalysis/", output_format = "tidy")

saveRDS(tweets, "data/sentanalysis/newstweets.rds")

You can download the tweets data directly from the source in the following way: the data was collected by Chris Barrie and is stored on his Github page.

tweets  <- readRDS(gzcon(url("https://github.com/cjbarrie/CTA-ED/blob/main/data/sentanalysis/newstweets.rds?raw=true")))

Inspect and filter data

Let’s have a look at the data:

head(tweets)
## # A tibble: 6 × 31
##   tweet_id         user_username text  lang  author_id source possibly_sensitive
##   <chr>            <chr>         <chr> <chr> <chr>     <chr>  <lgl>             
## 1 121233440226652… DailyMirror   "Sec… en    16887175  Tweet… FALSE             
## 2 121233416945767… DailyMirror   "RT … en    16887175  Tweet… FALSE             
## 3 121233319587999… thetimes      "A c… en    6107422   Echob… FALSE             
## 4 121233319486498… TheSun        "Way… en    34655603  Echob… FALSE             
## 5 121233292050719… DailyMailUK   "Stu… en    111556423 Socia… FALSE             
## 6 121233264057087… TheSun        "Dad… en    34655603  Twitt… FALSE             
## # ℹ 24 more variables: conversation_id <chr>, created_at <chr>, user_url <chr>,
## #   user_location <chr>, user_protected <lgl>, user_verified <lgl>,
## #   user_name <chr>, user_profile_image_url <chr>, user_description <chr>,
## #   user_created_at <chr>, user_pinned_tweet_id <chr>, retweet_count <int>,
## #   like_count <int>, quote_count <int>, user_tweet_count <int>,
## #   user_list_count <int>, user_followers_count <int>,
## #   user_following_count <int>, sourcetweet_type <chr>, sourcetweet_id <chr>, …
colnames(tweets)
##  [1] "tweet_id"               "user_username"          "text"                  
##  [4] "lang"                   "author_id"              "source"                
##  [7] "possibly_sensitive"     "conversation_id"        "created_at"            
## [10] "user_url"               "user_location"          "user_protected"        
## [13] "user_verified"          "user_name"              "user_profile_image_url"
## [16] "user_description"       "user_created_at"        "user_pinned_tweet_id"  
## [19] "retweet_count"          "like_count"             "quote_count"           
## [22] "user_tweet_count"       "user_list_count"        "user_followers_count"  
## [25] "user_following_count"   "sourcetweet_type"       "sourcetweet_id"        
## [28] "sourcetweet_text"       "sourcetweet_lang"       "sourcetweet_author_id" 
## [31] "in_reply_to_user_id"

Each row here is a tweets produced by one of the news outlets detailed above over a five month period, January–May 2020. Note also that each tweets has a particular date. We can therefore use these to look at any over time changes.

We won’t need all of these variables so let’s just keep those that are of interest to us:

tweets <- tweets %>%
  select(user_username, text, created_at, user_name,
         retweet_count, like_count, quote_count) %>%
  rename(username = user_username,
         newspaper = user_name,
         tweet = text)
username tweet created_at newspaper retweet_count like_count quote_count
EveningStandard We can’t complain: Two men spend coronavirus lockdown in London pub with ‘fresh beer on tap’ 🍺 https://t.co/rG65nGWv6q 2020-04-30T23:43:24.000Z Evening Standard 3 4 0
EveningStandard Best home spa treatments: face, body, nail and hair products for home https://t.co/nDZ65BbbVs 2020-04-30T23:57:09.000Z Evening Standard 0 2 0
guardian Coronavirus live news: Trump claims to have evidence virus started in Wuhan lab as UK is ‘past the peak’ https://t.co/LZv4yx2kn2 2020-04-30T23:57:59.000Z The Guardian 19 40 8
guardian Rugby league gets £16m emergency loan from government https://t.co/kZ9PP4aWjO 2020-04-30T23:57:59.000Z The Guardian 9 12 0
guardian Coronavirus latest: at a glance https://t.co/OrWrEdOwoU 2020-04-30T23:58:01.000Z The Guardian 8 11 3

We manipulate the data into tidy format again, unnesting each token (here: words) from the tweet text.

tidy_tweets <- tweets %>% 
  mutate(desc = tolower(tweet)) %>%
  unnest_tokens(word, desc) %>%
  filter(str_detect(word, "[a-z]"))

We’ll then tidy this further, as in the previous example, by removing stop words:

tidy_tweets <- tidy_tweets %>%
    filter(!word %in% stop_words$word)

Get sentiment dictionaries

Several sentiment dictionaries come bundled with the tidytext package. These are:

We can have a look at some of these to see how the relevant dictionaries are stored.

get_sentiments("afinn")
## # A tibble: 2,477 × 2
##    word       value
##    <chr>      <dbl>
##  1 abandon       -2
##  2 abandoned     -2
##  3 abandons      -2
##  4 abducted      -2
##  5 abduction     -2
##  6 abductions    -2
##  7 abhor         -3
##  8 abhorred      -3
##  9 abhorrent     -3
## 10 abhors        -3
## # ℹ 2,467 more rows
get_sentiments("bing")
## # A tibble: 6,786 × 2
##    word        sentiment
##    <chr>       <chr>    
##  1 2-faces     negative 
##  2 abnormal    negative 
##  3 abolish     negative 
##  4 abominable  negative 
##  5 abominably  negative 
##  6 abominate   negative 
##  7 abomination negative 
##  8 abort       negative 
##  9 aborted     negative 
## 10 aborts      negative 
## # ℹ 6,776 more rows
get_sentiments("nrc")
## # A tibble: 13,872 × 2
##    word        sentiment
##    <chr>       <chr>    
##  1 abacus      trust    
##  2 abandon     fear     
##  3 abandon     negative 
##  4 abandon     sadness  
##  5 abandoned   anger    
##  6 abandoned   fear     
##  7 abandoned   negative 
##  8 abandoned   sadness  
##  9 abandonment anger    
## 10 abandonment fear     
## # ℹ 13,862 more rows

What do we see here. First, the AFINN lexicon gives words a score from -5 to +5, where more negative scores indicate more negative sentiment and more positive scores indicate more positive sentiment. The nrc lexicon opts for a binary classification: positive, negative, anger, anticipation, disgust, fear, joy, sadness, surprise, and trust, with each word given a score of 1/0 for each of these sentiments. In other words, for the nrc lexicon, words appear multiple times if they enclose more than one such emotion (see, e.g., “abandon” above). The bing lexicon is most minimal, classifying words simply into binary “positive” or “negative” categories.

Let’s see how we might filter the texts by selecting a dictionary, or subset of a dictionary, and using inner_join() to then filter out tweet data. We might, for example, be interested in fear words. Maybe, we might hypothesize, there is a uptick of fear toward the beginning of the coronavirus outbreak. First, let’s have a look at the words in our tweet data that the nrc lexicon codes as fear-related words.

nrc_fear <- get_sentiments("nrc") %>% 
  filter(sentiment == "fear")

tidy_tweets %>%
  inner_join(nrc_fear) %>%
  count(word, sort = TRUE)
## Joining with `by = join_by(word)`
## # A tibble: 1,173 × 2
##    word           n
##    <chr>      <int>
##  1 mum         4509
##  2 death       4073
##  3 police      3275
##  4 hospital    2240
##  5 government  2179
##  6 pandemic    1877
##  7 fight       1309
##  8 die         1199
##  9 attack      1099
## 10 murder      1064
## # ℹ 1,163 more rows

We have a total of 1,174 words with some fear valence in our tweet data according to the nrc classification. Several seem reasonable (e.g., “death,” “pandemic”); others seems less so (e.g., “mum,” “fight”).

Domain-specific lexicons

Of course, list- or dictionary-based methods need not only focus on sentiment, even if this is one of their most common uses. In essence, what you’ll have seen from the above is that sentiment analysis techniques rely on a given lexicon and score words appropriately. And there is nothing stopping us from making our own dictionaries, whether they measure sentiment or not. In the data above, we might be interested, for example, in the prevalence of mortality-related words in the news. As such, we might choose to make our own dictionary of terms. What would this look like?

A very minimal example would choose, for example, words like “death” and its synonyms and score these all as 1. We would then combine these into a dictionary, which we’ve called “mordict” here.

word <- c('death', 'illness', 'hospital', 'life', 'health',
             'fatality', 'morbidity', 'deadly', 'dead', 'victim')
value <- c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1)
mordict <- data.frame(word, value)
mordict
##         word value
## 1      death     1
## 2    illness     1
## 3   hospital     1
## 4       life     1
## 5     health     1
## 6   fatality     1
## 7  morbidity     1
## 8     deadly     1
## 9       dead     1
## 10    victim     1

We could then use the same technique as above to bind these with our data and look at the incidence of such words over time. Combining the sequence of scripts from above we would do the following:

tidy_tweets %>%
  inner_join(mordict) %>%
  group_by(date, index = order %/% 1000) %>% 
  summarise(morwords = sum(value)) %>% 
  ggplot(aes(date, morwords)) +
  geom_bar(stat= "identity") +
  ylab("mortality words")
## Joining with `by = join_by(word)`
## `summarise()` has grouped output by 'date'. You can override using the
## `.groups` argument.

The above simply counts the number of mortality words over time. This might be misleading if there are, for example, more or longer tweets at certain points in time; i.e., if the length or quantity of text is not time-constant.

Why would this matter? Well, in the above it could just be that we have more mortality words later on because there are just more tweets earlier on. By just counting words, we are not taking into account the denominator.

An alternative, and preferable, method here would simply take a character string of the relevant words. We would then sum the total number of words across all tweets over time. Then we would filter our tweet words by whether or not they are a mortality word or not, according to the dictionary of words we have constructed. We would then do the same again with these words, summing the number of times they appear for each date.

After this, we join with our data frame of total words for each date. Note that here we are using full_join() as we want to include dates that appear in the “totals” data frame that do not appear when we filter for mortality words; i.e., days when mortality words are equal to 0. We then go about plotting as before.

mordict <- c('death', 'illness', 'hospital', 'life', 'health',
             'fatality', 'morbidity', 'deadly', 'dead', 'victim')

#get total tweets per day (no missing dates so no date completion required)
totals <- tidy_tweets %>%
  mutate(obs=1) %>%
  group_by(date) %>%
  summarise(sum_words = sum(obs))

#plot
tidy_tweets %>%
  mutate(obs=1) %>%
  filter(grepl(paste0(mordict, collapse = "|"),word, ignore.case = T)) %>%
  group_by(date) %>%
  summarise(sum_mwords = sum(obs)) %>%
  full_join(totals, word, by="date") %>%
  mutate(sum_mwords= ifelse(is.na(sum_mwords), 0, sum_mwords),
         pctmwords = sum_mwords/sum_words) %>%
  ggplot(aes(date, pctmwords)) +
  geom_point(alpha=0.5) +
  geom_smooth(method= loess, alpha=0.25) +
  xlab("Date") + ylab("% mortality words")
## `geom_smooth()` using formula = 'y ~ x'

Using Lexicoder

The above approaches use general dictionary-based techniques that were not designed for domain-specific text such as news text. The Lexicoder Sentiment Dictionary, by @young_affective_2012 was designed specifically for examining the affective content of news text. In what follows, we will see how to implement an analysis using this dictionary.

We will conduct the analysis using the quanteda package. You will see that we can tokenize text in a similar way using functions included in the quanteda package.

With the quanteda package we first need to create a “corpus” object, by declaring our tweets a corpus object. Here, we make sure our date column is correctly stored and then create the corpus object with the corpus() function. Note that we are specifying the text_field as “tweet” as this is where our text data of interest is, and we are including information on the date that tweet was published. This information is specified with the docvars argument. You’ll see then that the corpus consists of the text and so-called “docvars,” which are just the variables (columns) in the original dataset. Here, we have only included the date column.

tweets$date <- as.Date(tweets$created_at)
tweet_corpus <- corpus(tweets, text_field = "tweet", docvars = "date")
## Warning: docvars argument is not used.

We then tokenize our text using the tokens() function from quanteda, removing punctuation along the way:

toks_news <- tokens(tweet_corpus, remove_punct = TRUE)

We then take the data_dictionary_LSD2015 that comes bundled with quanteda and and we select only the positive and negative categories, excluding words deemed “neutral.” After this, we are ready to “look up” in this dictionary how the tokens in our corpus are scored with the tokens_lookup() function.

# select only the "negative" and "positive" categories
data_dictionary_LSD2015_pos_neg <- data_dictionary_LSD2015[1:2]
toks_news_lsd <- tokens_lookup(toks_news, dictionary = data_dictionary_LSD2015_pos_neg)

This creates a long list of all the texts (tweets) annotated with a series of ‘positive’ or ‘negative’ annotations depending on the valence of the words in that text. The creators of quanteda then recommend we generate a document feature matric from this. Grouping by date, we then get a dfm object, which is a quite convoluted list object that we can plot using base graphics functions for plotting matrices.

# create a document document-feature matrix and group it by date
dfmat_news_lsd <- dfm(toks_news_lsd) %>% 
  dfm_group(groups = date)

# plot positive and negative valence over time
matplot(dfmat_news_lsd$date, dfmat_news_lsd, type = "l", lty = 1, col = 1:2,
        ylab = "Frequency", xlab = "")
grid()
legend("topleft", col = 1:2, legend = colnames(dfmat_news_lsd), lty = 1, bg = "white")

# plot overall sentiment (positive  - negative) over time

plot(dfmat_news_lsd$date, dfmat_news_lsd[,"positive"] - dfmat_news_lsd[,"negative"], 
     type = "l", ylab = "Sentiment", xlab = "")
grid()
abline(h = 0, lty = 2)

Alternatively, we can recreate this in tidy format as follows:

negative <- dfmat_news_lsd@x[1:121]
positive <- dfmat_news_lsd@x[122:242]
date <- dfmat_news_lsd@Dimnames$docs


tidy_sent <- as.data.frame(cbind(negative, positive, date))

tidy_sent$negative <- as.numeric(tidy_sent$negative)
tidy_sent$positive <- as.numeric(tidy_sent$positive)
tidy_sent$sentiment <- tidy_sent$positive - tidy_sent$negative
tidy_sent$date <- as.Date(tidy_sent$date)

And plot accordingly:

tidy_sent %>%
  ggplot() +
  geom_line(aes(date, sentiment))

Exercises

  1. Take a subset of the tweets data by “user_name” These names describe the name of the newspaper source of the Twitter account. Do we see different sentiment dynamics if we look only at different newspaper sources?

First the subsetting and some preprocessing

# to subset means to take only a sample according to a specific condition. Here I'm going to look only at tabloid media. 
tweets_tabloid <- tweets %>%
  filter(newspaper %in% c("The Mirror", "The Sun", "Daily Mail U.K.", "Metro"))

# create corpus
tweets_tabloid_corpus <- corpus(tweets_tabloid, text_field = "tweet")

# check that all docvars have been correctly recognised
names(docvars(tweets_tabloid_corpus)) 
## [1] "username"      "created_at"    "newspaper"     "retweet_count"
## [5] "like_count"    "quote_count"   "date"
# tokenising and tidying
toks_tabloid <- tokens(tweets_tabloid_corpus, 
                       remove_punct = TRUE, # remove punctuation
                       remove_url = TRUE, # remove urls
                       remove_numbers = TRUE, # remove numbers
                       remove_symbols = TRUE) %>% # remove symbols
  tokens_select(pattern = stopwords("en"), selection = "remove") %>% # remove stopwords
  tokens_tolower()

We will need a denominator for the frequency of our sentiment words, so we need to calculate a total for each tabloid. We could could calculate those totals based on different stages it on different degrees of preprocessing. Here I want to calculate the total tokens after preprocessing (i.e. without punctuation, stopwords or urls), so I base the total on the tokens object we just created (toks_tabloid)

# now calculate total tokens for each newspaper
total_dfm_tabloid <- dfm(toks_tabloid) %>%
  dfm_group(groups = newspaper) %>% # group the dfm by newspaper
  convert(to = "data.frame") %>% # convert to data frame so it's easier to manipulate
  group_by(doc_id) %>% # group the data frame by newspaper for the calculation
  reframe(total = rowSums(across(everything()))) # calculate total for each row (total tokens)

# have a look at the first rows to check all looks good
head(total_dfm_tabloid)
## # A tibble: 4 × 2
##   doc_id           total
##   <chr>            <dbl>
## 1 Daily Mail U.K. 191408
## 2 Metro            65434
## 3 The Mirror      463809
## 4 The Sun         291059

Now we move to the sentiment analysis. I use the NRC dictionary but I prefer to use quanteda so I first reformat it into a quanteda dictionary object, so I can then refer to it within the quanteda command ‘token_lookup()’

# turn the NRC dictionary into a quanteda dictionary
data_dictionary_NRC <- get_sentiments("nrc")
data_dictionary_NRC <- as.dictionary(data_dictionary_NRC)

#get tweet sentiments by newspaper
toks_tabloid_nrc <- toks_tabloid %>%
  tokens_lookup(dictionary = data_dictionary_NRC)

# turn into document feature matrix (dfm)
dfm_tabloid_nrc <- dfm(toks_tabloid_nrc) %>% 
  dfm_group(groups = newspaper) %>%
  convert(to = "data.frame") # convert to data frame

# join with the totals by newspaper
dfm_tabloid_nrc <- dfm_tabloid_nrc %>% 
  full_join(total_dfm_tabloid, by="doc_id") %>%
  rename("newspaper" = doc_id) # rename the 'doc_id' column to 'newspaper'

# let's have a look at the numbers by newspaper and by sentiment
kable(dfm_tabloid_nrc)
newspaper anger anticipation disgust fear joy negative positive sadness surprise trust total
Daily Mail U.K. 8167 8382 4303 11326 7215 16355 16528 8830 4476 10874 191408
Metro 1833 2861 1213 2935 2225 4090 5132 2200 1501 3307 65434
The Mirror 20732 19903 13258 33365 16807 44747 37089 23482 12292 23589 463809
The Sun 9879 13081 6383 15732 13235 22553 25230 10131 7739 13970 291059

Now we can calculate relative frequencies by sentiment and format for plotting. Using mutate() with across() allows you to modify multiple columns at once.

dfm_tabloid_nrc_pct <- dfm_tabloid_nrc %>%
  mutate(across(c("anger":"trust"), ~round((.x/total)*100, digits=1))) 
# here the instruction is that each row value (x) in columns from "anger" to "trust" should be divided by the row total, and then rounded.
# you could also just do it with simpler code, one column at at time, with more lines of code. It would look like this:
# mutate(anger = round((anger/total)*100, digits=1),
#        trust = round((trust/total)*100, digits=1)) # etc...

# we pivot the data to 'long' format to make it easier to plot
dfm_tabloid_nrc_pct <- dfm_tabloid_nrc_pct %>%
  select(-total) %>% # remove the 'total' column
  pivot_longer(c(anger:trust), names_to = "sentiment", values_to = "frequency")
# plot by newspaper
dfm_tabloid_nrc_pct %>%
  ggplot() + # when we enter ggplot environment we need to use '+' not '%>%', 
  geom_col(aes(x=newspaper, y=frequency, group=sentiment, fill=newspaper)) + # reordering newspaper variable so it is displayed from most negative to most positive
  coord_flip() + # pivot plot by 90 degrees
  facet_wrap(~sentiment, nrow = 2) + # create multiple plots for each
  ylab("Sentiment relative frequency") + # label y axis
  scale_fill_manual(values = c("blue", "darkblue", "red", "pink")) + # pick the colours
  guides(fill = "none") + # no need to show legend for colour 
  theme_minimal() # pretty graphic theme

Tabloids don’t differ much on trust, surprise or anticipation; relative to the others,the Mirror and the Daily Mail use more words associated with anger, fear and sadness, while the Sun uses slightly more joyful words. Overall, the Mirror and the Daily Mail are more negative and use a variety of negative sentiments (fear, anger etc) more extensively than the other two. Metro seems less likely to invoke any sentiment-connotated word, as it shows lower frequency in almost all sentiment categories. This may be a sign that the writing in Metro is somewhat less sensationalist.

  1. Build your own (minimal) dictionary-based filter technique and plot the result
# first we do this using only the tidyverse
trans_words <- c('trans', 'transgender', 'trans rights', 'trans rights activists', 'transphobic', 'terf', 'terfs', 'transphobia', 'transphobes', 'gender critical', 'LGBTQ', 'LGBTQ+')

#get total tweets per day (no missing dates so no date completion required)
totals_newspaper <- tidy_tweets %>%
  mutate(obs=1) %>%
  group_by(newspaper) %>%
  summarise(sum_words = sum(obs))

#plot
tidy_tweets %>%
  mutate(obs=1) %>%
  filter(grepl(paste0(trans_words, collapse = "|"), word, ignore.case = T)) %>%
  group_by(newspaper) %>%
  summarise(sum_mwords = sum(obs)) %>%
  full_join(totals_newspaper, word, by="newspaper") %>%
  mutate(sum_mwords= ifelse(is.na(sum_mwords), 0, sum_mwords),
         pcttranswords = sum_mwords/sum_words) %>%
  ggplot(aes(x=reorder(newspaper, -pcttranswords), y=pcttranswords)) +
  geom_point() +
  xlab("newspaper") + ylab("% words referring to trans or terfs") +
  coord_flip() +
  theme_minimal()

The Sun looks like it discusses trans people and trans rights (or transphobia) particularly often.

  1. Apply the Lexicoder Sentiment Dictionary to the news tweets, but break down the analysis by newspaper
# we go back to the initial corpus
toks_news <- tokens(tweet_corpus, 
                    remove_punct = TRUE,
                    remove_url = TRUE,
                    remove_numbers = TRUE,
                    remove_symbols = TRUE) %>%
  tokens_select(pattern = stopwords("en"), selection = "remove")

toks_news_lsd <- tokens_lookup(toks_news, 
                               dictionary = data_dictionary_LSD2015_pos_neg)

# recreate a document-feature matrix but instead of grouping it by date, we group it by 'username' (aka newspapers)
dfm_news_lsd <- dfm(toks_news_lsd) %>% 
  dfm_group(groups = username) 

# convert it to a dataframe so it's easier to use
tidy_dfm_news_lsd <- dfm_news_lsd %>%
  convert(to = "data.frame") %>%
  rename("newspaper" = doc_id) %>% # when converting to data.frame, R called our grouping variable 'doc_id'. We rename it 'newspaper' instead.
  mutate(sentiment = positive - negative) # create variable for overall sentiment

# plot by newspaper
tidy_dfm_news_lsd %>%
  ggplot() + # when we enter ggplot environment we need to use '+' not '%>%', 
  geom_point(aes(x=reorder(newspaper, -sentiment), y=sentiment)) + # reordering newspaper variable so it is displayed from most negative to most positive
  coord_flip() + # pivot plot by 90 degrees
  xlab("Newspapers") + # label x axis
  ylab("Overall tweet sentiment (negative to positive)") + # label y axis
  theme_minimal() # pretty graphic theme

Difficult to interpret… Tabloids (The Daily Mirror, the Sun and the Daily Mail) seems to write overall more negative tweets than more traditional newspapers. This is especially true for The Daily Mirror. Overall it may be interesting to note that the more left-leaning papers (the Daily Mirror and the Guardian) also appear the most negative within their respective genre (tabloids and non-tabloid newspapers).

Because many of you wanted to analyse sentiment not just by newspaper but by newspaper and date, I include code to do this.

# recreate a document-feature matrix but instead of grouping it just by date or just by newspaper, we group it by both (we interact the two)
dfm_news_lsd <- dfm(toks_news_lsd) %>% 
  dfm_group(groups = interaction(username, date)) # we group by interaction variable between newspaper and date

# convert it to a dataframe so it's easier to use
tidy_dfm_news_lsd <- dfm_news_lsd %>%
  convert(to = "data.frame") 

# the interaction has batched together newspaper name and date (e.g. DailyMailUK.2020-01-01). 

# We want to separate them into two distinct variables. We can do it using the command extract() and regex. It's easy because the separation is always a .
tidy_dfm_news_lsd <- tidy_dfm_news_lsd %>%
  extract(doc_id, into = c("newspaper", "date"), regex = "([a-zA-Z]+)\\.(.+)") 

# nice! now we again have two distinct clean variables called 'newspaper' and 'date'.

# arrange by date
tidy_dfm_news_lsd <- tidy_dfm_news_lsd %>%
  mutate(date = as.Date(date)) %>% # clarify to R this is a date
  arrange(date) 

# recreate variable for overall sentiment
tidy_dfm_news_lsd <- tidy_dfm_news_lsd %>%
  mutate(sentiment = positive - negative) 

# plot
tidy_dfm_news_lsd %>%
  ggplot(aes(x=date, y=sentiment)) +
  geom_point(alpha=0.5) + # plot points
  geom_smooth(method= loess, alpha=0.25) + # plot smooth line
  facet_wrap(~newspaper, nrow = 2) + # 'faceting' means multiplying the plots so that there is one plot for each member of the group (here, sentiment) that way you can easily compare trend across group.
  xlab("date") + ylab("overall sentiment (negative to positive)") +
  ggtitle("Tweet sentiment trend across 8 British newspapers") +
  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'

The Mirror is clealry all more negative overall, but also more dispersed whereas the Times, the Telegraph, the Guardian and Metro all show compact and stable sentiment over time. The increase in positive words use that we saw in the overall analysis seems to have been driven chiefly by the Mirror and the Sun.

  1. Don’t forget to ‘knit’ to produce your final html output for the exercise.